18 research outputs found
A RelEntLess Benchmark for Modelling Graded Relations between Named Entities
Relations such as "is influenced by", "is known for" or "is a competitor of"
are inherently graded: we can rank entity pairs based on how well they satisfy
these relations, but it is hard to draw a line between those pairs that satisfy
them and those that do not. Such graded relations play a central role in many
applications, yet they are typically not covered by existing Knowledge Graphs.
In this paper, we consider the possibility of using Large Language Models
(LLMs) to fill this gap. To this end, we introduce a new benchmark, in which
entity pairs have to be ranked according to how much they satisfy a given
graded relation. The task is formulated as a few-shot ranking problem, where
models only have access to a description of the relation and five prototypical
instances. We use the proposed benchmark to evaluate state-of-the-art relation
embedding strategies as well as several recent LLMs, covering both publicly
available LLMs and closed models such as GPT-4. Overall, we find a strong
correlation between model size and performance, with smaller Language Models
struggling to outperform a naive baseline. The results of the largest Flan-T5
and OPT models are remarkably strong, although a clear gap with human
performance remains
An Efficient Multilingual Language Model Compression through Vocabulary Trimming
Multilingual language model (LM) have become a powerful tool in NLP
especially for non-English languages. Nevertheless, model parameters of
multilingual LMs remain large due to the larger embedding matrix of the
vocabulary covering tokens in different languages. On the contrary, monolingual
LMs can be trained in a target language with the language-specific vocabulary
only, but this requires a large budget and availability of reliable corpora to
achieve a high-quality LM from scratch. In this paper, we propose
vocabulary-trimming (VT), a method to reduce a multilingual LM vocabulary to a
target language by deleting irrelevant tokens from its vocabulary. In theory,
VT can compress any existing multilingual LM to build monolingual LMs in any
language covered by the multilingual LM. In our experiments, we show that VT
can retain the original performance of the multilingual LM, while being smaller
in size (in general around 50% of the original vocabulary size is enough) than
the original multilingual LM. The evaluation is performed over four NLP tasks
(two generative and two classification tasks) among four widely used
multilingual LMs in seven languages. Finally, we show that this methodology can
keep the best of both monolingual and multilingual worlds by keeping a small
size as monolingual models without the need for specifically retraining them,
and even limiting potentially harmful social biases
A Practical Toolkit for Multilingual Question and Answer Generation
Generating questions along with associated answers from a text has
applications in several domains, such as creating reading comprehension tests
for students, or improving document search by providing auxiliary questions and
answers based on the query. Training models for question and answer generation
(QAG) is not straightforward due to the expected structured output (i.e. a list
of question and answer pairs), as it requires more than generating a single
sentence. This results in a small number of publicly accessible QAG models. In
this paper, we introduce AutoQG, an online service for multilingual QAG, along
with lmqg, an all-in-one Python package for model fine-tuning, generation, and
evaluation. We also release QAG models in eight languages fine-tuned on a few
variants of pre-trained encoder-decoder language models, which can be used
online via AutoQG or locally via lmqg. With these resources, practitioners of
any level can benefit from a toolkit that includes a web interface for end
users, and easy-to-use code for developers who require custom models or
fine-grained controls for generation.Comment: Accepted by ACL 2023 System Demonstratio
Back to the basics: a quantitative analysis of statistical and graph-based term weighting schemes for keyword extraction
Term weighting schemes are widely used in Natural Language Processing and Information Retrieval. In particular, term weighting is the basis for keyword extraction. However, there are relatively few evaluation studies that shed light about the strengths and shortcomings of each weighting scheme. In fact, in most cases researchers and practitioners resort to the well-known tf-idf as default, despite the existence of other suitable alternatives, including graph-based models. In this paper, we perform an exhaustive and large-scale empirical comparison of both statistical and graph-based term weighting methods in the context of keyword extraction. Our analysis reveals some interesting findings such as the advantages of the less-known lexical specificity with respect to tf-idf, or the qualitative differences between statistical and graph-based methods. Finally, based on our findings we discuss and devise some suggestions for practitioner
Twitter Topic Classification
Social media platforms host discussions about a wide variety of topics that
arise everyday. Making sense of all the content and organising it into
categories is an arduous task. A common way to deal with this issue is relying
on topic modeling, but topics discovered using this technique are difficult to
interpret and can differ from corpus to corpus. In this paper, we present a new
task based on tweet topic classification and release two associated datasets.
Given a wide range of topics covering the most important discussion points in
social media, we provide training and testing data from recent time periods
that can be used to evaluate tweet classification models. Moreover, we perform
a quantitative evaluation and analysis of current general- and domain-specific
language models on the task, which provide more insights on the challenges and
nature of the task.Comment: Accepted at COLING 202
Generative language models for paragraph-level question generation
Powerful generative models have led to recent progress in question generation (QG). However, it is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches. In this paper, we introduce QG-Bench, a multilingual and multidomain benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting. It includes general-purpose datasets such as SQuAD for English, datasets from ten domains and two styles, as well as datasets in eight different languages. Using QG-Bench as a reference, we perform an extensive analysis of the capabilities of language models for the task. First, we propose robust QG baselines based on fine-tuning generative language models. Then, we complement automatic evaluation based on standard metrics with an extensive manual evaluation, which in turn sheds light on the difficulty of evaluating QG models. Finally, we analyse both the domain adaptability of these models as well as the effectiveness of multilingual models in languages other than English.QG-Bench is released along with the fine-tuned models presented in the paper (https://github.com/asahi417/lm-question-generation), which are also available as a demo (https://autoqg.net/)